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A method of estimating the pitch of a speech signal using 
previous estimates, use of the method, and a device 
adapted therefor 



The invention relates to a method of estimating the pitch 
of a speech signal, said method being of the type where 
the speech signal is divided into segments, a conformity 
f-unction for the signal is calculated for each segment, 
and peaks in the conformity function are detected. The 
invention also relates to the use of the method in a mo- 
bile telephone. Further, the invention relates to a de- 
vice adapted to estimate the pitch of a speech signal. 

In many speech processing systems it is desirable to know 
the pitch period of the speech. As an example, several 
speech enhancement algorithms are dependent on having a 
correct estimate of the pitch period. One field of appli- 
cation where speech processing algorithms are widely used 
is in mobile telephones . 

A well known way of estimating the pitch period is to use 
the autocorrelation fionction, or a similar conformity 
function, on the speech signal. An example of such a 
method is described in the article D. A. Krubsack, R. J. 
Niederjohn, "An Autocorrelation Pitch Detector and Voic- 
ing Decision with Confidence Measures Developed for 
Noise -Corrupted Speech", IEEE Transactions on Signal 
Processing, vol. 39, no. 2, pp. 319-329, Febr. 1991. The 
speech signal is divided into segments of 51.2 ms, and 
the standard short-time autocorrelation function is cal- 
culated for each successive speech segment. A peak pick- 
ing algorithm is applied to the autocorrelation function 
of each segment. This algorithm starts by choosing the 
maximum peak (largest value) in the pitch range of 50 to 



333 Hz. The period corresponding to this peak is selected 
as an estimate of the pitch period. 



However, such a basic pitch estimation algorithm is not 
sufficient. In some cases pitch doubling or pitch halving 
can occur, i.e. the highest peak appears at either half 
the pitch period or twice the pitch period. The highest 
peak may also appear at another multiple of the true 
pitch period. In these cases a simple selection of the 
maximum peak will provide a wrong estimate of the pitch 
period. 

The above-mentioned article also discloses a method of 
improving the algorithm in these situations. The algo- 
rithm checks for peaks at one-half, one- third, one- 
fourth, one- fifth, and one-sixth of the first estimate of 
the pitch period. If the half of the first estimate is 
within the pitch range, the maximum value of the autocor- 
relation within an interval around this half value is lo- 
cated. If this new peak is greater than one-half of the 
old peak, the new corresponding value replaces the old 
estimate, thus providing a new estimate which is presuma- 
bly corrected for the possibility of the pitch period 
doubling error. This test is performed again to check for 
double doubling errors (fourfold errors) . If this most 
recent test fails, a similar test is performed for tri- 
pling errors of this new estimate. This test checks for 
pitch period errors of sixfold. If the original test 
failed, the original estimate is tested (in a similar 
manner) for tripling errors and errors of fivefold. The 
final value is used to calculate the pitch estimate. 

However, this known algorithm is rather complex and re- 
quires a high number of calculations, and these drawbacks 
make it less usable in real time environments on small 
digital signal processors as they are used in mobile 
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telephones and similar devices. Further, the algorithm 
only checks for pitch doubling, pitch tripling, etc., 
while pitch halving is not considered. Actually, if a 
peak is present at the half of the true pitch period, the 
algorithm would (wrongly) choose that peak as the esti- 
mate of the pitch period. 

Thus, it is an object of the invention to provide a 
method of the above-mentioned type which is less complex 
than the prior art methods, such that the method is suit- 
able for small digital signal processors. Farther, the 
method should alsa avoid the pitch halving situation. 

According to the invention, this object is achieved in 
that the method further comprises the steps of calculat- 
ing an average value of pitch estimates estimated in a 
number of previous segments, calculating for each peak in 
the conformity function the difference between the posi- 
tion of the peak and said average value, and using the 
position of the peak having the smallest value of said 
difference as an estimate of the pitch. 

In the situation where previously detected pitch period 
estimates are available, which will often be the case, a 
small difference is expected between the correct pitch 
period and the average of the previous pitch periods. 
This is due to the fact that the pitch period only varies 
a little while a person is talking. Therefore, the peak 
which is closest to the average of the estimates of the 
previous segments is most likely to be the correct pitch 
and will thus be the best estimate. By simply selecting 
this peak much computation is avoided and a simple algo- 
rithm is achieved. 

When the method further comprises the steps of sampling 
the speech signal to obtain a series of samples, and per- 



forming the division into segments such that each segment 
has a fixed number of consecutive samples, an even less 
complex method is achieved because only a finite number 
of samples has to be considered. 

When the method further comprises the steps of estimating 
a set of filter parameters using linear predictive analy- 
sis (LPA) , providing a modified signal by filtering the 
speech signal through a filter based on this estimated 
set of filter parameters, and calculating the conformity 
fxinction of the modified signal, much of the smearing of 
the original speech signal is removed and thus the possi- 
bility of clearer peaks in the conformity function is im- 
proved, which results in a more precise estimation of the 
pitch period. 

An expedient embodiment of the invention is achieved when 
the conformity function is calculated as an autocorrela- 
tion function. However, it should be noted that also 
other conformity functions may be utilized, such as e.g. 
a cross correlation between the original speech signal 
and the above-mentioned modified signal. 

If the peak having the smallest value of the difference 
is represented by a number of samples, the best estimate 
is achieved when the sample having the maximum amplitude 
of the conformity function is selected as the estimate of 
the pitch. 

In an expedient embodiment of the invention the method is 
used in a mobile telephone, which is a typical example of 
a device having only limited computational resources . 

As mentioned, the invention further relates to a device 
adapted to estimate the pitch of a speech signal. The de- 
vice comprises means for dividing the speech signal into 



segments, means for calculating for each segment a con- 
formity function for the signal, and means for detecting 
peaks in the conformity function. When the device is fur- 
ther adapted to calculate an average value of pitch esti- 
mates estimated in a number of previous segments, to cal- 
culate for each peak in the conformity function the dif- 
ference between the position of the peak and said average 
value, and to use the position of the peak having the 
smallest value of said difference as an estimate of the 
pitch, a device less complex than prior art devices is 
achieved, which also avoids the pitch halving situation. 

When the device further comprises means for sampling the 
speech signal to obtain a series of samples, and means 
for performing said division into segments such that each 
segment has a fixed number of consecutive samples, an 
even less complex device is achieved because only a fi- 
nite number of samples has to be considered. 

When the device further comprises means for estimating a 
set of filter parameters using linear predictive analysis 
(LPA) , means for providing a modified signal by filtering 
the speech signal through a filter based on this esti- 
mated set of filter parameters, and means for calculating 
the conformity function of the modified signal, much of 
the smearing of the original speech signal is removed and 
thus the possibility of clearer peaks in the conformity 
function is improved, which results in a more precise es- 
timation of the pitch period. 

An expedient embodiment of the invention is achieved when 
the conformity f\inction is an autocorrelation fxinction. 
However, it should be noted that also other conformity 
fiinctions may be utilized, such as e.g. a cross correla- 
tion between the original speech signal and the above- 
mentioned modified signal. 



If the peak having the smallest value of the difference 
is represented by a number of samples, the best estimate 
is achieved when the sample having the maximum amplitude 
of the conformity function is selected as the estimate of 
the pitch. 

In an expedient embodiment of the invention, the device 
is a mobile telephone, which is a typical example of a 
device having only limited computational resources. 

In another embodiment the device is an integrated circuit 
which can be used in different types of equipment. 

The invention will now be described more fully below with 
reference to the drawing, in which 

figure 1 shows a block diagram of a pitch detector ac- 
cording to the invention, 

figure 2 shows the generation of a residual signal, 

figure 3a shows a 20 ms segment of a voiced speech sig- 
nal , 

figure 3b shows the autocorrelation fi^nction of a resid- 
ual signal corresponding to the segment of figure 3a, 

figure 4 shows an example of an autocorrelation f-unction 
where pitch doubling could arise, and 

figure 5 shows an example of the calculation of the dis- 
tance between peaks in an autocorrelation fxmction. 

Figure 1 shows a block diagram of an example of a pitch 
detector 1 according to the invention. A speech signal 2 
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is sampled with a sampling rate of 8 kHz in the sampling 
circuit 3 and the samples are divided into segments or 
frames of 160 consecutive samples. Thus, each segment 
corresponds to 20 ms of the speech signal. This is the 
sampling and segmentation normally used for the speech 
processing in a standard mobile telephone. 

Each segment of 160 samples is then processed in a filter 
4, which will be described in further detail below. 

First, however, the nature of speech signals will be men- 
tioned briefly. In a classical approach a speech signal 
is modelled as an output of a slowly time-varying linear 
filter. The filter is either excited by a quasi -periodic 
sequence of pulses or random noise depending on whether a 
voiced or an unvoiced sound is to be created. The pulse 
train which creates voiced soxinds is produced by pressing 
air out of the lungs through the vibrating vocal cords. 
The period of time between the pulses is called the pitch 
period and is of great importance for the singularity of 
the speech. On the other hand, unvoiced sounds are gener- 
ated by forming a constriction in the vocal tract and 
produce turbulence by forcing air through the constric- 
tion at a high velocity. This description deals with the 
detection of the pitch period of voiced sounds and thus, 
unvoiced sovmds will not be further considered. 

As speech is a varying signal also the filter has to be 
time -varying. However, the properties of a speech signal 
change relatively slowly with time. It is reasonable to 
believe that the general properties of speech remain 
fixed for periods of 10-20 ms . This has led to the basic 
principle that if short segments of the speech signal are 
considered, each segment can effectively be modelled as 
having been generated by exciting a linear time -invariant 
system during that period of time. The effect of the fil- 



ter can be seen as caused by the vocal tract, the tongue, 
the mouth and the lips. 

As mentioned, voiced speech can be interpreted as the 
output signal from a linear filter driven by an excita- 
tion signal. This is shown in the upper part of figure 2 
in which the pulse train 21 is processed by the filter 22 
to produce the voiced speech signal 23. A good signal for 
the detection of the pitch period is obtained if the ex- 
citation signal can be extracted from the speech. By es- 
timating the filter parameters A in the block 24 and then 
filtering the spe.ech through an inverse filter 25 based 
on the estimated filter parameters, a signal 2 6 similar 
to the excitation signal can be obtained. This signal is 
called the residual signal. This process is shown in the 
lower part of figure 2 . The blocks 24 and 25 are included 
in the filter 4 in figure 1. 

The estimation of the filter parameters is based on an 
all -pole modelling which is performed by means of the 
method called linear predictive analysis (LPA) . The name 
comes from the fact that the method is equivalent with 
linear prediction. This method is well known in the art 
and will not be described in further detail here. 

The estimation of the pitch is based on the autocorrela- 
tion of the residual signal, which is obtained as de- 
scribed above. Thus, the output signal from the filter 4 
is taken to an autocorrelation calculation unit 5 . Figure 
3a shows an example of a 2 0 ms segment of a voiced speech 
signal and figure 3b the corresponding autocorrelation 
function of the residual signal. It will be seen from 
figure 3a that the actual pitch period is about 5.25 ms 
corresponding to 42 samples, and thus the pitch estima- 
tion should end up with this value. 
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The next step in the estimation of the pitch is to apply 
a peak picking algorithm to the autocorrelation function 
provided by the unit 5 . This is done in the peak detector 
6 which identifies the maximum peak (i.e. the largest 
value) in the autocorrelation function. The index value, 
i.e. the sample number or the lag, of the maximum peak is 
then used as a preliminary estimate of the pitch period. 
In the case shown in figure 3b it will be seen that the 
maximum peak is actually located at a lag of 42 samples. 
The search of the maximum peak is only performed in the 
range where a pitch period is likely to be located. In 
this case the rang.e is set to 60-333 Hz. 

However, this basic pitch estimation algorithm is not al- 
ways sufficient. In some cases pitch doubling or halving 
may occur, i.e. due to distortion the peak in the auto- 
correlation function corresponding to the true pitch pe- 
riod is not the highest peak, but instead the highest 
peak appears at either half the pitch period or twice the 
pitch period. The highest peak could also appear at other 
multiples of the actual pitch period (pitch tripling, 
etc.) although this occurs relatively rarely. A typical 
example where pitch dovibling would arise is shown in fig- 
ure 4 which again shows the autocorrelation function of 
the residual signal. Here too, the correct pitch period 
would be around 42 samples, but the peak at twice the 
pitch period, i.e. around 84 samples, is actually higher 
than the one at 42 samples. The basic pitch estimation 
algorithm would therefore estimate the pitch period to 84 
samples and pitch doubling would thus occur. It will also 
be seen that two smaller peaks are located around half 
the pitch period, and in some cases one of these could be 
higher than the correct peak and pitch halving would oc- 
cur . 
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To avoid the problem of pitch doubling and halving the 
pitch detection algorithm is therefore improved as de- 
scribed below. 

After the preliminary pitch estimate has been determined, 
it is checked in the risk check unit 7 whether there is 
any risk of pitch halving or pitch doubling. All peaks 
with a peak value higher than 75% of the maximum peak are 
detected and the further processing depends on the result 
of this detection. If only one peak is detected, i.e. the 
original maximum peak, there is no need to perform a 
process to avoid . pitch do\ibling and pitch halving. In 
this situation the preliminary pitch estimate is used as 
the final pitch estimate. If, however, more than one peak 
is detected, there is a risk of pitch doubling or pitch 
halving, and a further algorithm must be performed to en- 
sure that the correct peak is selected as the pitch esti- 
mate . 

Two different solutions to such an algorithm will be de- 
scribed. One solution, which is performed in the vmit 8, 
is used when pitch estimates are available from a number 
of previous segments, while the other solution, which is 
performed in the unit 9, is used when such estimates are 
not available, which will be the case in the beginning of 
a speech signal. The latter solution is described first. 

In cases where no previously estimated pitch periods are 
available, the procedure to avoid pitch doiabling and 
pitch halving is based on the fact that the identified 
peaks show a periodic behaviour. Actually it can be said 
that the pitch period simply corresponds to the distance 
between the peaks. Index values, i.e. the lag, of the de- 
tected peaks are sorted into groups depending on how 
close to each other the indexes are. In many cases a peak 
can be represented by more than one index, i.e. more than 
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one sample, resulting in several indexes around a peak 
being detected. Indexes with a distance of less than e.g. 
five samples are sorted into the same group. 

For each group an average is calculated and then differ- 
ences (distances) between the averaged indexes are calcu- 
lated. The difference towards zero is also calculated 
since the first peak may be the actual pitch period. If 
the detected peaks represent the periodic behaviour of 
the speech signal in the current segment the differences 
between the groups ought to be about the same. 

Therefore, if the variance of the differences between the 
groups is below a given threshold, e.g. 10, the average 
of the differences, i.e. the average distance, is assumed 
to be approximately the pitch period and is thus used as 
a secondary estimate of the pitch period. The variance 
threshold can be set from watching probable differences 
between mean values and their variance. 

An example of this procedure is shown in figure 5 in 
which level I shows the received indexes of the highest 
peaks. In level II the indexes are sorted into groups and 
the mean values of the groups are calculated in level 
III. The differences between mean values are shown in 
level IV and finally, the variance is calculated in level 
V. 

The average distance may be used directly as the pitch 
estimate, or the method can be improved by sTibtracting 
the average distance from each of the average indexes 
representing different groups (level III) . The group in 
which the smallest result of this svLbtraction, i.e. the 
group closest to the average distance, is found is se- 
lected as the pitch estimate. 



If, however, the variance is above the threshold, it 
means that the distances between peaks are too different 
to represent the periodic behaviour of the signal. In 
this case the method cannot be used and the preliminary 
5 pitch estimate is maintained as the best estimate. 

When this method has been used for a number of consecu- 
tive segments, and if the pitch estimates for these seg- 
ments are stored in a memory, these previous estimates 
10 may be used in a different method of avoiding pitch dou- 
bling and pitch halving. This method is described below. 

First, an average of the previous pitch estimates from 
e.g. the last 15 segments is calculated. This value is 

15 then subtracted from the index values where the highest 
peaks in the autocorrelation function of the residual 
signal are located, which means that the differences be- 
tween the index values of the highest peaks and the aver- 
age of the previously detected pitch periods are calcu- 

20 lated. Since the pitch period for a given person is rela- 
tively constant over time, a small difference between the 
correct pitch period of the current segment and the aver- 
age of the previous pitch estimates is expected. There- 
fore, those values in the resulting vector of subtraction 

25 results that are below a given threshold, e.g. 10, are 
selected. The use of the threshold is due to the fact 
that the pitch period may actually vary slightly while a 
person is talking, and therefore such a difference has to 
be accepted. The actual threshold can be set from watch- 

3 0 ing probable examples. 

If only one difference is below the threshold the corre- 
sponding index value or lag is selected as the estimate 
of the pitch period. If more than one difference is below 
3 5 the threshold, the one with the highest amplitude in the 
autocorrelation of the residual signal is selected. If 
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there are no differences below the threshold, this indi- 
cates that the pitch has changed drastically, as it may 
e.g. be the case when switching speakers. In such a case 
the preliminary pitch estimate is maintained as the best 
5 estimate. 

This method utilizing previous estimates is considerably 
less complex than the other one based on the distance be- 
tween the peaks, and therefore it should be used as soon 
10 as there are sufficient previous estimates in order to 
reduce the needed amount of computational resources. 

As mentioned above, one example of equipment in which the 
invention can be implemented is a mobile telephone. The 
15 algorithm may also be implemented in an integrated cir- 
cuit which may then be used in other types of equipment . 

Although a preferred embodiment of the present invention 
has been described and shown, the invention is not re- 

2 0 stricted to it, but may also be embodied in other ways 

within the scope of the STibj ect -matter defined in the 
following claims. 

Thus, the autocorrelation function may be calculated di- 
25 rectly of the speech signal instead of the residual sig- 
nal, or other conformity functions may be used instead of 
the autocorrelation function. As an example, a cross cor- 
relation could be calculated between the speech signal 
and the residual signal. It is also possible to repeat 
30 the autocorrelation, i.e. to calculate the autocorrela- 
tion of the result of the first autocorrelation, before 
detecting peaks . 

Further, different sampling rates and sizes of the seg- 

3 5 ments may be used. 



